1. Dataset 


e A dataset is a particular instance of data 
that is used for analysis or model building at 
any given time. 


* A dataset comes in different flavors such as 
numerical data, categorical data, text data, 


image data, voice data, and video data. 


e For beginning data science projects, the 
most popular type of dataset is a dataset 
containing numerical data that is typically 
stored in a comma-separated values (CSV) 
file format 


2. 


e Data wrangling is the process of converting 
data from its raw form to a tidy form ready for 
analysis. 


e Data wrangling is an important step in data 
preprocessing and includes several processes 
like data importing, data cleaning, data 
structuring, string processing, HTML parsing, 
handling dates and times, handling missing 
data, and text mining. eee 
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3. Visualization 


e It is one of the main tools used to analyze 
and study relationships between different 
variables. 


e Data visualization (e.g., scatter plots, line 
graphs, bar plots, histograms, qqplots, smooth 
densities, boxplots, pair plots, heat maps, etc.) 
can be used for descriptive analytics. 


e Data visualization is also used in machine 
learning for data preprocessing and analysis, 
feature selection, model building, model 
testing, and model evaluation. 
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4, Outliers 


e An outlier is a data point that is very 
different from the rest of the dataset. 


e Outliers are very common and are expected 
in large datasets. 


e One common way to detect outliers in a 
dataset is by using a box plot. 


e Outliers can significantly degrade the 
predictive power of a machine learning model. 


e Advanced methods for dealing with outliers 
include the RANSAC method. 
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5. Data Imputation 


e Most datasets contain missing values. 
However, the removal of samples or dropping 
of entire feature columns is simply not feasible 
because we might lose too much valuable data. 


* So, here we can use different interpolation 
techniques to estimate the missing values from 
the other training samples in our dataset. 


e One of the most common interpolation 
techniques is mean imputation, where we 
simply replace the missing value with the mean 
value of the entire feature column. 


6. Data Scaling 


e Scaling your features will help improve the 
quality and predictive power of your model. 


e Without scaling your features, the model will 
be biased towards a particular feature. 


e In order to bring features to the same 


scale, we could decide to use either 
normalization or standardization of features. 
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7. Data Partitioning 


e In machine learning, the dataset is often 
partitioned into training and testing sets. 


e The model is trained on the training 
dataset and then tested on the testing dataset. 


e The testing dataset thus acts as the unseen 
dataset, which can be used to estimate a 
generalization error (the error expected when 
the model is applied to a real-world dataset 
after the model has been deployed). 
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8. Supervised Learning 
e These are machine learning algorithms that 
perform learning by studying the relationship 
between the feature variables and the known 


target variable. 


Supervised learning has two subcategories: 


a) Continuous Target Variables 


e Linear Regression, KNeighbors regression 
(KNR), and Support Vector Regression (SVR). 


b) Discrete Target Variables 


e Logistic Regression classifier, Support Vector 
Machines (SVM), Decision tree classifier 
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9. Unsupervised Learning 


e In unsupervised learning, we deal with 
unlabeled data or data of unknown structure. 

e Using unsupervised learning techniques, we are 
able to explore the structure of our data to extract 


meaningful information without the guidance of a 
known outcome variable or reward function. 


e K-means clustering is an example of an 
unsupervised learning algorithm. 
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10. Reinforcement Learning 


e Reinforcement Learning(RL) is a type of 
machine learning technique that enables an agent 
to learn in an interactive environment by trial and 
error using feedback from its own actions and 
experiences. 


e Reinforcement learning uses rewards and 
punishment as signals for positive and negative 
behavior. 
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11. Cross-validation 


e Cross-validation is a method of evaluating a 
machine learning model’s performance across 
random samples of the dataset. 


¢ In k-fold cross-validation, the dataset is 
randomly partitioned into training and testing sets. 


e The model is trained on the training set and 
evaluated on the testing set. The process is repeated 
k-times. 


e The average training and testing scores are then 
calculated by averaging over the k-folds. 
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12. Bias-variance Tradeoff 


e A model having high bias and low variance 


assumes more assumptions about the form of the 


target function, and a model having high 
variance and low bias over learns the training 
dataset. 


* The parameters of the model should be tuned to 


get the best fit model, that performs the best in 
production. 
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13. Principal Component Analysis (PCA) 


* Principal Component Analysis (PCA) is a 
statistical method that is used for feature 
extraction. It is used for high-dimensional and 
correlated data. 


* The basic idea of PCA is to transform the 
original space of features into the space of the 
principal component. 
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14. Linear Discriminant Analysis (LDA) 


e Linear Discriminant Analysis is a dimensionality 
reduction technique which is commonly used for 
the supervised classification problems. 


e It is used for modeling differences in groups i.e. 
separating two or more classes. 


e Just like the PCA, It is used to project the 
features in higher dimension space into a lower 
dimension space. 
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15) Model Parameters and 
Hyperparameters 


Model Parameters 


* These are the parameters in the model that 
must be determined using the training data set. 
These are the fitted parameters. 


e For example, suppose we have a model such 
as house price = a + b*(age) + c* (size). 


* To estimate the cost of houses 
based on the age of the house and 
its size (square foot), then a, b, 
and c will be our model or fitted 
parameters. 


Hyperparameters 


e These are adjustable parameters that must be 
tuned to obtain a model with optimal performance. 


Some examples of hyperparameters in 
machine learning: 


e Learning Rate 

e Number of Epochs 

e Regularization constant 

e Number of branches in a decision tree 

e Number of clusters in a clustering algorithm (like 
k-means) 


e It is important that during training, the 
hyperparameters be tuned to obtain the model with 
the best performance (with the best-fitted 
parameters). 


16) Evaluation Meirics 


e In machine learning (predictive analytics), there 
are several metrics that can be used for model 
evaluations. 


e A supervised learning (discrete target) model, 
also referred to as a classification model, can be 
evaluated using metrics such as accuracy, 
precision, recall, f1 score, and the area under 
ROC curve (AUC). 
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17) Math Concepts 


a) Basic Calculus 


e Most machine learning models are built with 
a dataset having several features or predictors. 
Hence, familiarity with multivariable calculus is 
extremely important for building a machine 
learning model. Here are the topics you need to 
be familiar with: 
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b) Basic Linear Algebra 


e Linear algebra is the most important math 
skill in machine learning. A data set is 
represented as a matrix. Linear algebra is used 
in data preprocessing, data transformation, 
dimensionality reduction, and model 
evaluation. Here are the topics you need to be 
familiar with: 
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c) Optimization Methods 


e Most machine learning algorithms perform 
predictive modeling by minimizing an objective 
function, thereby learning the weights that must 
be applied to the testing data in order to obtain 
the predicted labels. Here are the topics you 
need to be familiar with: 
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18) Statistics and Probability 
Concepts 

e Statistics and Probability are used for 
visualization of features, data preprocessing, 
feature transformation, data imputation, 
dimensionality reduction, feature engineering, 
model evaluation, etc. Here are the topics you 
need to be familiar with: 


Mean, Median, Mode, Standard 
deviation / variance, Correlation coefficient and 


the covariance matrix, Probability distributions 


(Binomial, Poisson, Normal), p-value, Bayes 
Theorem (Precision, Recall, Positive Predictive 
Value, Negative Predictive Value, Confusion 
Matrix, ROC Curve), Central Limit Theorem, R_2 
score, Mean Square Error (MSE), A/B Testing, 
Monte Carlo Simulation 


19) Regularization 


e Regularisation is a technique used to reduce the 
errors by fitting the function appropriately on the 
given training set and avoid overfitting. 

The commonly used regularisation techniques are : 


1 ) L1 regularisation LASSO (Least Absolute Shrinkage and Selection Operator) 
2) L2 regularisation Ridge Regression 
3) Dropout regularisation 
e Lasso Regression adds “absolute value of 
magnitude” of coefficient as penalty term to the loss 
function (L). 
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e Ridge regression adds “squared magnitude” of 
coefficient as penalty term to the loss function(L). 
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